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Abstract 

Speech recognition systems are often highly domain de¬ 
pendent, a fact widely reported in the literature. However the 
concept of domain is complex and not bound to clear criteria. 
Hence it is often not evident if data should be considered to be 
out-of-domain. While both acoustic and language models can 
be domain specific, work in this paper concentrates on acoustic 
modelling. We present a novel method to perform unsupervised 
discovery of domains using Latent Dirichlet Allocation (LDA) 
modelling. Here a set of hidden domains is assumed to exist 
in the data, whereby each audio segment can be considered to 
be a weighted mixture of domain properties. The classification 
of audio segments into domains allows the creation of domain 
specific acoustic models for automatic speech recognition. Ex¬ 
periments are conducted on a dataset of diverse speech data cov¬ 
ering speech from radio and TV broadcasts, telephone conver¬ 
sations, meetings, lectures and read speech, with a joint training 
set of 60 hours and a test set of 6 hours. Maximum A Posteriori 
(MAP) adaptation to LDA based domains was shown to yield 
relative Word Error Rate (WER) improvements of up to 16% 
relative, compared to pooled training, and up to 10%, compared 
with models adapted with human-labelled prior domain knowl¬ 
edge. 

Index Terms: domain discovery, latent dirichlet allocation, 
adaptation, speech recognition 

1. Introduction 

Recently, new applications and domains are becoming the tar¬ 
get of research in Automatic Speech Recognition (ASR), as the 
existing systems increase their accuracy. This has opened the 
issue on how to scale up existing systems when new domains 
are incorporated as target data, for instance “found data”, such 
as media and historical audio archives. In this situation, training 
acoustic models for an unknown domain, like different YouTube 
recordings, can be infeasible if the origin of the target speech 
can not be properly assessed, and the loss of accuracy is large 
due to wrong modelling decisions. 

Well-tailored single domain systems, where training data 
that properly matches the target recognition data is available, 
are mostly used in current speech recognisers. These do¬ 
main dependent models have been usually trained via Max¬ 
imum Likelihood (ML) if a sufficiently large amount of do¬ 
main data existed or using adaptation techniques such as Maxi¬ 
mum A Posteriori (T||> Maximum Likelihood Linear Regression 
(MLLR) O or Cluster Adaptive Training 0. For more recent 
Deep Neural Network (DNN)-based systems, domain adapta¬ 
tion is also possible with linear transformations, conservative 
training and subspace methods Q) with frameworks such as 
Multi-Level Adaptive Networks (MLAN) m or Deep Maxout 
Networks (DMN) ©. 


An important issue when dealing with highly diverse 
speech data is the difficulty to appropriately categorise every 
speech input within a particular domain, especially the case 
with newly discovered data. Even when domain categories have 
been given manually by humans, this may be inaccurate or there 
may be hidden characteristics in the audio that can further sub¬ 
divide these categories or cross across several of the predefined 
domains. Developing the ability of discovering these new and 
hidden acoustic domains would greatly enhance the possibility 
of using well-targeted specific domain models in ASR. How¬ 
ever, as most speech recognition tasks assume a single domain 
or well differentiated domains, the task of unsupervised discov¬ 
ery of acoustic domains in speech data has been of less interest 
so far. This paper proposes to open new areas for research in 
multi-domain ASR by treating speech data as a set of docu¬ 
ments where latent domains exist and can be discovered using 
Latent Dirichlet Allocation (LDA) models. 

LDA is an statistical approach to discover latent topics in 
a collection of documents in an unsupervised manner Q. It is 
mostly used in Natural Language Processing (NLP) for the cat¬ 
egorisation of text documents, but it has been used for audio and 
image processing as well. In audio tasks, LDA has been used 
for classifying unstructured audio files into onomatopoeic and 
semantic descriptions with successful results (na. Building 
on this knowledge, this work proposes to use LDA for domain 
adaptation in ASR tasks. 

This paper is organised as follows: Section|^will give an 
overview of LDA modelling in its original proposal for topic 
modelling. Then, Section will describe the proposed use 
of LDA models for unsupervised domain discovery in speech 
data. Section will present the experimental setup used for 
multi-domain speech recognition, with Section [^detailing the 
obtained results. Sectionj^gives the conclusions to this work. 

2. Latent Dirichlet Allocation 

Latent Dirichlet Allocation (LDA) (3 is an unsupervised prob¬ 
abilistic generative model for collections of discrete data. It 
aims to describe how every item within the collection is gener¬ 
ated, assuming that there are a set of hidden topics and that each 
item is modelled as a finite mixture over those topics. Also, an 
infinite mixture over an underlying set of topic probabilities is 
used to model each topic |7J- LDA is mostly used for topic 
modelling of text corpora, however, the model can be applied 
to other tasks, such as object categorisation and localisation in 
image processing Qo), automatic harmonic analysis in music 
processing mi or acoustic information retrieval in unstructured 
audio analysis 

In the context of text corpora, a dataset is defined as a 
collection of documents and each document is a collection of 
words. Given a vocabulary of size V, each word is represented 
by a H-dimensional binary vector. It is assumed that the docu- 



ments are generated using the following generative process: 

1. For each document dm,m £ {1...M}, choose a K- 
dimensional topic weight vector Om from the Dirich- 
let distribution with scaling parameter a: p{9m\a) = 
Dir{a) 

2. For each word w„,n£ {l...Af}in document dm 

(a) Draw a topic Zn G {1...K} from the multinomial 
distribution p(z„ = k\9m) 

(b) Given the topic, draw a word from p{wn\zn, P), 
where /3 is a 1/ x K matrix and 

Pij =p{w„ =i\Zr, = j,P) 

Other assumptions include the bag-of-words property of 
the documents and the fixed and known dimensionality of the 
Dirichlet distribution K (and thus the dimensionality of the 
topic variable z) 

The graphical representation of LDA model is shown at 
Figure ^ a three level hierarchical Bayesian model. In this 
model, the only observed variable is ui and the rest are all la¬ 
tent. a and P are corpus level parameters, 9m are document 
level variables and 2 „, Wn are word level variables. The gener¬ 
ative process is described formally as: 

N 

p( 6 i,z, w|a,/3) =p{9\a) Y\^p{zr^\9)p{w„\z„, P) (1) 

n=l 

The posterior distribution of the latent topic variables given the 
words and a and P parameters is: 


{xi,..., xt, ..., xt}, is represented as a set of discrete symbols 
to support modelling within this framework. For that purpose, 
the n-dimensional audio frames, xt G R", are quantised into 
a dictionary of V acoustic “words”, Xt G HI- First 

a Gaussian Mixture Model (GMM) is trained using Expecta¬ 
tion Maximisation (EM) and mix-up procedure to reach the de¬ 
sired codebook size V (enforcing the co-variance matrix to be 
identity, equivalent to LBG-VQ ifMl ). Then the means of the 
Gaussian components are used to create the codebook and quan¬ 
tise the audio frames into discrete symbols. The assignment of 
frame Xi to codebook index j is performed using: 

Xt = argmin ||xt — mj|| ,j G {1...V} (5) 

j 

where mj is jth mixture component’s mean vector. 

To reconcile this with the LDA terminology described in 
Section 1^ in this work each audio segment is a “document” 
and each codified audio frame is a “word”. All the audio seg¬ 
ments (now “documents”) then create a whole “collection” or 
“corpus”. 

Once all the audio frames are converted to discrete “words”, 
the parameters of the LDA model using K domains are esti¬ 
mated on the M audio segments from the training data using 
variational EM. The domain of each quantised audio segment x 
is then given by the domain with the highest value of the poste¬ 
rior Dirichlet parameter 7 for that segment. 

Domain(x) = argmax , j G {1..K} (6) 

i 

Based on the estimated parameters from the training set, 
Dirichlet parameters 7 can be inferred for the test set segments 
as well. With every segment in both train and test sets asso¬ 
ciated to a hidden domain, it is possible to perform training 
and/or adaptation with the usual techniques. Acoustic mod¬ 
els can be trained via Maximum Likelihood (ML), or domain 
specific models can be adapted via MAP or MLLR, in case of 
GMM/HMM systems. 


p(6l,zjw,a,/3) 


p(6),z,w|q,^) 

p(w|a,/3) 


( 2 ) 


Computing p(w|q;, P) requires some intractable integrals. A 
reasonable approximate can be acquired using variational ap¬ 
proximation which is shown to work reasonably well in various 
applications m. The approximated posterior distribution is: 

N 

<l{d, z|7, P) = gWl) n q{Zr,\pri) (3) 

n=l 


where 7 is the Dirichlet parameter that determine 9 and 4> is the 
parameter for the multinomial that generates the topics. 

Training tries to minimise the Kullback-Leiber divergence 
(KLDfdll between the real and the approximated joint proba¬ 
bilities (equations|^and[^ Q: 

argmin KLD{^q{9,z\'y,(f>) 11 p(0, zjw, a,/?)) (4) 

Other training methods based on Markov-Chain Monte-Carlo 
is also proposed, like Gibbs sampling method Ill3l . 


3. Unsupervised Domain Discovery 

The proposed technique uses an LDA model to discover hid¬ 
den and latent acoustic domains in multi-domain speech data. 
Since LDA is for collections of discrete data (such as text cor¬ 
pora) (3, every speech segment of length T frames, x = 


4. Experimental setup 

To evaluate the proposed domain discovery and adaptation 
method in a multi-domain and diverse ASR task, a dataset of 6 
different types of data was chosen from the following sources: 

• Radio (RD): BBC Radio4 broadcasts on February 2009. 

• Television (TV): Broadcasts from BBC on May 2008. 

• Telephone speech (CT): From the Fisher corpu^ llSI . 

• Meetings (MT): From AMI (H and ICSI QT) corpora. 

• Lectures (TK): From TedTalks fTsl . 

• Read speech (RS): From the WSJCAMO corpus (T^ . 

A subset of lOh from each domain was selected to form the 
training set (60h in total), and Ih from each domain was used for 
testing ( 6 h in total). The selection of the domains aims to cover 
the most common and distinctive types of audio recordings used 
in ASR tasks. 

Two types of acoustic features were used: First, 13 PLP 
features plus first and second derivatives for a total of 39- 
dimensional feature vectors; and second, a 65-dimensional fea¬ 
ture vector concatenating the 39 PLP features and 26 bottle¬ 
neck (PLP-l-BN) features extracted from a 4-hidden-layer DNN 
trained on the full 60 hours of data. 31 adjacent frames (15 

*A11 of the telephone speech data was up-sampled to 16 kHz to 
match the sampling rate of the rest of the data. 











frames to the left and 15 frames to the right) of 23 dimensional 
log Mel filter bank features were concatenated to form a 713- 
dimensional super vector; Discrete Cosine Transform (DCT) 
was applied to this super vector to de-correlate and compress 
it to 368 dimensions and then fed into the neural network. The 
network was trained on 4,000 triphone state targets and the 26 
dimensional bottleneck layer was placed before the output layer. 
The objective function used was frame-level cross-entropy and 
the optimisation was done with stochastic gradient descent and 
the backpropagation algorithm. DNN training was performed 
with the TNet toolkit 1201 and more details can be found at ED- 
For both types of features, baseline ML GMM-HMM mod¬ 
els were trained using FITK E2l with 5-state crossword tri¬ 
phones and 16 gaussians per state. The language model used 
was based on a 50,000-word vocabulary and was trained by 
combination of language models from the 6 domains, with inter¬ 
polation weights tuned using an independent development set. 

4.1. Baseline results 

Table[^presents the baseline Word Error Rate (WER) results for 
the in-domain maximum-likelihood (ML) model trained with 
the pooled 60 hours of all domains, plus the results of ML in¬ 
domain models each trained with 10 hours of in-domain data. It 
also includes the MAP adapted models from the pooled model 
to each domain. Experiments were conducted using PLP and 
PLP-l-BN features. The results using ML training on the limited 
in-domain data underperformed MAP adaptation on such data, 
which set MAP as a preferred setup for domain adaptation. 


Table 1: WER (%) of baseline models 


Eeatures 

Model 

RS RD TK CT MT TV 

Total 


ML 

17.3 18.4 34.1 46.6 44.0 51.1 

36.0 

PLP 

ML Domain 

16.9 19.1 35.1 44.4 44.0 52.9 

36.3 


MAP 

14.6 16.8 31.8 43.5 40.4 49.6 

33.6 


ML 

13.0 13.3 23.5 33.5 32.2 42.0 

26.8 

PLP-l-BN 

ML Domain 

12.6 14.0 25.0 34.3 33.2 44.0 

27.9 


MAP 

12.1 12.8 23.1 32.5 30.6 41.5 

26.2 


5. Results 

The experiments performed aimed to evaluate two aspects of the 
proposed LDA modelling for unsupervised domain discovery. 
First, if LDA could be successfully used to find hidden domains 
and if these domains represented the hidden characteristics of 
the audio. Second, once hidden domains had been identified, if 
domain adaptation could be applied on them and improvements 
in ASR performance were achieved over the baselines. 

5.1. Unsupervised domain discovery 

For using LDA models, as described in Section two param¬ 
eters had to be initially set up. First, the number of domains 
K to be found had to be decided prior to the training. Also, 
since the audio frames needed to be quantised, the size of the 
codebook V also needed to be defined. For this end, a set of 
experiments were conducted with different codebook sizes and 
number of domains. Codebooks of size 128 up to 8,192 were 
used and given a codebook, different LDA models with a vary¬ 
ing number of domains from 4 to 64 were estimated 1231 l24l 
using the training data described in Section]^ 

Since these identified domains were latent, there was no 
ground huth to verify them at this stage. An initial way of eval¬ 
uating how the different latent domains behaved was by mea¬ 
suring the distribution of the data, according to manual labels. 



D1 D2 D3 D4 D5 D6 D7 D8 


Discovered Domains 

Figure 2; Amount of data for each discovered domain (K = 8) 
from the labelled domains using a codebook size of 2,048 

which was included in each hidden domain. Figure [^presents 
this distribution for an acoustic codebook of size 2,048 and 8 
hidden domains. From this Figure, it is possible to see how 
telephone speech was separated into two different hidden do¬ 
mains (D1 and D3), while meeting speech was mostly assigned 
to a unique hidden domain (D7). Other manually labelled do¬ 
mains, such as Radio and Television broadcasts were scattered 
across hidden domains (D2, D4 or D8), indicating the presence 
of previously unseen domains within these types of data. 

Following this, KL divergence mi was proposed as an ap¬ 
propriate metric to measure the consistency of the hidden topics 
discovered by LDA. This measured how the distributions of data 
in latent domains, as in Figure]^ in different sets, for instance 
training and testing data, were different with each other: 

iTLD(P||Q) = ;^P(i)lnm (7) 

where P and Q are the distributions for training and test data. 
To compute the divergence, since we deal with counts in the 
distributions and some counts can be zero, the distributions are 
smoothed by discounting 3% of the total mass and distributing 
it across zero counts. 

Figure]^ shows the divergence values of different configu¬ 
rations. Low values of divergence indicated a more consistent 
set of hidden domains found by LDA modelling and, thus, were 
preferred over configurations with higher values. In terms of 
codebook size, codebooks of 2,048 and 8,192 symbols resulted 
in lower divergence. For the number of domains, increasing to 
more than 12 resulted an increase in divergence. 

5.2. Domain adaptation 

For the evaluation of the possibilities offered by the unsuper¬ 
vised discovery of domains in ASR, MAP domain adaptation 
was performed to each of these new domains. The experiments 
were conducted with domains of size 4, 6, 8, 10 and 12 and a 
codebook of acoustic words of size 2,048. Each MAP adapted 
domain specific model was used to decode the corresponding 
speech segments in the test set that were assigned to that do¬ 
main. Eigurej^shows the overall WER on the test set with dif¬ 
ferent number of topics using both types of features, PLP and 
PLP-l-BN. The lowest WER values, 30.4% for PLP features and 















































Figure 3: KL divergence of training and test set topics 



Number of latent domains 


25.4% for PLP+BN, were achieved with 8 domains for both 
types of features, which was 16% and 5% relative improve¬ 
ment over their respective ML baselines. Comparing with MAP 
adaptation to human-labelled domains the relative WER reduc¬ 
tion was 10% and 3%. The improvements in WER vanished for 
more than 8 hidden domains, indicating that using larger num¬ 
bers of domains were not beneficial for this task. 

Table presents the breakout of the results using 8 hid¬ 
den domains across the manually labelled domains. Improve¬ 
ments occur across all of these domains, indicating that the EDA 
model can benefit all types of speech in this setup. The domains 
that achieved the highest gains from using EDA MAP adapta¬ 
tion (with PEP feature) were read speech, telephone speech and 
TV broadcasts, with relative WER reductions of 14%, 12%, 
10% respectively compared to MAP adaptation on the manu¬ 
ally labelled domains. The lowest gain, 4% relative, occurred 
on meeting speech. Similarly, with PLP-l-BN features telephone 
speech, lectures and read speech benefited the most, with rela¬ 
tive WER reduction of 5%, 4% and 2% respectively. 


Table 2: WER (%) of EDA MAP Models (K = 8) 


Features 

Model 

RS RD TK CT MT TV 

Total 

PEP 

MAP 

EDA MAP 

14.6 16.8 31.8 43.5 40.4 49.6 
12.5 15.3 29.1 38.2 38.5 44.7 

33.6 

30.4 

PLP-l-BN 

MAP 

EDA MAP 

12.1 12.8 23.1 32.5 30.6 41.5 
11.9 12.8 22.3 31.1 31.0 41.0 

26.2 

25.4 


Einally, Tablej^shows the WER across the hidden domains 
for both types of features with EDA MAP models. The most 
relevant feature of these domains, in terms of WER, was that 
the domains of low WER (like Read speech) or high WER (like 
TV data) had been broken up in different hidden domains and 
hence, WERs across hidden domains were evenly distributed. 

Table 3: WER (%) of EDA MAP Models {K = 8) across hid- 


den domains 


Features 

D1 D2 D3 D4 D5 D6 D7 D8 

Total 

PEP 

37.3 34.9 39.7 39.2 24.6 17.1 38.7 22.9 

30.4 

PLP-^BN 

33.9 29.2 30.4 32.8 19.7 12.6 30.9 19.2 

25.4 


6. Conclusions 

A novel technique based on Latent Dirichlet Allocation (EDA) 
has been proposed to discover latent domains in highly-diverse 
speech data in an un-supervised manner. The data set consisted 


Figure 4: WER (%) of EDA MAP adapted models with differ¬ 
ent number of topics 


of data from TV and radio shows, meetings, lectures, talks and 
telephony speech with a 60-hour training set and 6-hour test 
set. It was assumed that there are a set of hidden domains and 
each audio segment is a mixture of different properties of those 
hidden domains with different weights. EDA models were used 
to discover the latent domains and then these domains were used 
to perform Maximum A Posteriori (MAP) domain adaptation. 
Results showed relative improvement of up to 16% over the 
baseline Maximum Likelihood trained models and up to 10% 
over the MAP adapted models to human labelled domains with 
the EDA discovered domains. 

The bag-of-words assumption in EDA model does not take 
the order of words into account. In applying EDA for image 
processing, there are some variants of the original EDA model, 
such as Spatial EDA (25] which encodes spatial structure with 
the visual words. A temporal variant of EDA could better han¬ 
dle the temporal nature of speech and needs to be investigated 
as a future work. Also applying the current technique on big¬ 
ger and/or less diverse data set needs to be verified to see what 
would be the new discovered domains and how they are related 
to domain adaptation. Newer sets of features, better targeted 
to describe background acoustic characteristics 1261 . could also 
provide an improvement over PEP features, which are known to 
describe well phonetic and speaker information. 
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